Project 2: Team Assignment

Authors

Alex Ptacek

Amanda Knudsen

Curtis Elsasser

Sheriann Mclarty

Yana Rabkova

Published

May 13, 2025

Assignment

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of “PH”.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Setup

library(corrplot)
library(caret)
library(earth)
library(e1071)
library(Formula)
library(GGally)
library(ggpubr)
library(glmnet)
library(glue)
library(janitor)
library(knitr)
library(Metrics)
library(plotmo)
library(plotrix)
library(pls)
library(RANN)
library(randomForest)
library(readr)
library(readxl)
library(rpart)
library(rpart.plot)
library(tidyverse)

Load the Data

First we must load the data. We have been provided two datasets. We presumed that one was designated for training (“StudentData.xlsx”) and the other for testing (“StudentEvaluation.xlsx”). But, we found “StudentEvaluation.xlsx” does not have a target variable. So, we concluded that “StudentData.xlsx” is for both training and testing and that “StudentEvaluation.xlsx” is for our final prediction.

train_data <- read_excel("StudentData.xlsx")
test_data <- read_excel("StudentEvaluation.xlsx")

EDA

George Hagstrom, our professor of DATA 607, defined exploratory data analysis as “…the art of looking at data in a systematic way in order to understand the the underlying structure of the data. It has two main goals: ensure data quality and uncover patterns to guide future analysis. EDA is detective work: you ask and answer questions, which inspires more questions.” And that is exactly what we will do. We are going to get to know it by summarizing it, scrutinizing it, looking at it’s correlation properties, and visualizing it.

Summary Information

We will take a look at a sample of the data. We shall also examine our data’s columns and types, the number of rows and columns, and the first few rows of the data. And finally, we will calculate some summary statistics of the data: the mean, median, min, max, and standard deviation of each column so that we may better understand the nature of our data.

head(train_data)
# A tibble: 6 × 33
  `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
  <chr>                <dbl>         <dbl>       <dbl>           <dbl>
1 B                     5.34          24.0       0.263            68.2
2 A                     5.43          24.0       0.239            68.4
3 B                     5.29          24.1       0.263            70.8
4 A                     5.44          24.0       0.293            63  
5 A                     5.49          24.3       0.111            67.2
6 A                     5.38          23.9       0.269            66.6
# ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
#   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
#   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
#   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
#   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
#   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
#   `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, …
str(train_data)
tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
 $ Brand Code       : chr [1:2571] "B" "A" "B" "A" ...
 $ Carb Volume      : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
 $ Fill Ounces      : num [1:2571] 24 24 24.1 24 24.3 ...
 $ PC Volume        : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
 $ Carb Pressure    : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
 $ Carb Temp        : num [1:2571] 141 140 145 133 137 ...
 $ PSC              : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
 $ PSC Fill         : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
 $ PSC CO2          : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
 $ Mnf Flow         : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
 $ Carb Pressure1   : num [1:2571] 119 122 120 115 118 ...
 $ Fill Pressure    : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
 $ Hyd Pressure1    : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
 $ Hyd Pressure2    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
 $ Hyd Pressure3    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
 $ Hyd Pressure4    : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
 $ Filler Level     : num [1:2571] 121 119 120 118 119 ...
 $ Filler Speed     : num [1:2571] 4002 3986 4020 4012 4010 ...
 $ Temperature      : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
 $ Usage cont       : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
 $ Carb Flow        : num [1:2571] 2932 3144 2914 3062 3054 ...
 $ Density          : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
 $ MFR              : num [1:2571] 725 727 735 731 723 ...
 $ Balling          : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
 $ Pressure Vacuum  : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
 $ PH               : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
 $ Oxygen Filler    : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
 $ Bowl Setpoint    : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
 $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
 $ Air Pressurer    : num [1:2571] 143 143 142 146 146 ...
 $ Alch Rel         : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
 $ Carb Rel         : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
 $ Balling Lvl      : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

We can see there are 33 columns in total and 2,571 rows in total. The 33 columns includes the “PH” column which is what we will be aiming to predict – PH will be the response variable in our linear regression exploration.

We can see that we have mostly numeric values. The only character or categorical (non-numeric) is the Brand Code. Based on information in our team’s preferred guidance on predictive modeling, to deal with non-numeric values (a.k.a. categorical values, which Brand Code is the only one) the recommendation is to either convert into dummy variables or remove if not informative or the value has too many categories. We have found that it has significant importance in some of our models, we will convert it to a dummy variable before training (since models such as Lasso requires input predictors to be numeric).

We can also see that there are negative values which means we won’t be able to apply the BoxCox method for processing our data. Our guidance and standards, from Applied Predictive Modeling, state that if the data includes negatives we should use the YeoJohnson method instead.

We will also take a look at the summary statistics of our data. We would normally use the summary() function, but it prints wide and not long and does not present well. So, we are going to take a stick shift approach.

df <- select(train_data, -`Brand Code`)
data.frame(
  mean = sapply(df, mean, na.rm = TRUE) |> round(3),
  median = sapply(df, median, na.rm = TRUE) |> round(3),
  min = sapply(df, min, na.rm = TRUE) |> round(3),
  max = sapply(df, max, na.rm = TRUE) |> round(3),
  sd = sapply(df, sd, na.rm = TRUE) |> round(3)
) |>
  kable()
mean median min max sd
Carb Volume 5.370 5.347 5.040 5.700 0.106
Fill Ounces 23.975 23.973 23.633 24.320 0.088
PC Volume 0.277 0.271 0.079 0.478 0.061
Carb Pressure 68.190 68.200 57.000 79.400 3.538
Carb Temp 141.095 140.800 128.600 154.000 4.037
PSC 0.085 0.076 0.002 0.270 0.049
PSC Fill 0.195 0.180 0.000 0.620 0.118
PSC CO2 0.056 0.040 0.000 0.240 0.043
Mnf Flow 24.569 65.200 -100.200 229.400 119.481
Carb Pressure1 122.586 123.200 105.600 140.200 4.743
Fill Pressure 47.922 46.400 34.600 60.400 3.178
Hyd Pressure1 12.438 11.400 -0.800 58.000 12.433
Hyd Pressure2 20.961 28.600 0.000 59.400 16.386
Hyd Pressure3 20.458 27.600 -1.200 50.000 15.976
Hyd Pressure4 96.289 96.000 52.000 142.000 13.123
Filler Level 109.252 118.400 55.800 161.200 15.698
Filler Speed 3687.199 3982.000 998.000 4030.000 770.820
Temperature 65.968 65.600 63.600 76.200 1.383
Usage cont 20.993 21.790 12.080 25.900 2.978
Carb Flow 2468.354 3028.000 26.000 5104.000 1073.696
Density 1.174 0.980 0.240 1.920 0.378
MFR 704.049 724.000 31.400 868.600 73.898
Balling 2.198 1.648 -0.170 4.012 0.931
Pressure Vacuum -5.216 -5.400 -6.600 -3.600 0.570
PH 8.546 8.540 7.880 9.360 0.173
Oxygen Filler 0.047 0.033 0.002 0.400 0.047
Bowl Setpoint 109.327 120.000 70.000 140.000 15.303
Pressure Setpoint 47.615 46.000 44.000 52.000 2.039
Air Pressurer 142.834 142.600 140.800 148.200 1.212
Alch Rel 6.897 6.560 5.280 8.620 0.505
Carb Rel 5.437 5.400 4.960 6.060 0.129
Balling Lvl 2.050 1.480 0.000 3.660 0.870

Missing Values (NA)

Let’s quantify how much data is missing from our data.

count_missing <- function(data) {
  data |>
    summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(
      everything(),
      names_to = "variable",
      values_to = "missing"
    ) |>
    filter(missing > 0) |>
    mutate(
      ratio = missing / nrow(data)
    ) |>
    arrange(desc(missing))
}

count_missing(train_data) |>
  kable()
variable missing ratio
MFR 212 0.0824582
Brand Code 120 0.0466744
Filler Speed 57 0.0221704
PC Volume 39 0.0151692
PSC CO2 39 0.0151692
Fill Ounces 38 0.0147802
PSC 33 0.0128355
Carb Pressure1 32 0.0124465
Hyd Pressure4 30 0.0116686
Carb Pressure 27 0.0105018
Carb Temp 26 0.0101128
PSC Fill 23 0.0089459
Fill Pressure 22 0.0085570
Filler Level 20 0.0077791
Hyd Pressure2 15 0.0058343
Hyd Pressure3 15 0.0058343
Temperature 14 0.0054454
Oxygen Filler 12 0.0046674
Pressure Setpoint 12 0.0046674
Hyd Pressure1 11 0.0042785
Carb Volume 10 0.0038895
Carb Rel 10 0.0038895
Alch Rel 9 0.0035006
Usage cont 5 0.0019448
PH 4 0.0015558
Mnf Flow 2 0.0007779
Carb Flow 2 0.0007779
Bowl Setpoint 2 0.0007779
Density 1 0.0003890
Balling 1 0.0003890
Balling Lvl 1 0.0003890
count_missing(test_data) |>
  kable()
variable missing ratio
PH 267 1.0000000
MFR 31 0.1161049
Filler Speed 10 0.0374532
Brand Code 8 0.0299625
Fill Ounces 6 0.0224719
PSC 5 0.0187266
PSC CO2 5 0.0187266
PC Volume 4 0.0149813
Carb Pressure1 4 0.0149813
Hyd Pressure4 4 0.0149813
PSC Fill 3 0.0112360
Oxygen Filler 3 0.0112360
Alch Rel 3 0.0112360
Fill Pressure 2 0.0074906
Filler Level 2 0.0074906
Temperature 2 0.0074906
Usage cont 2 0.0074906
Pressure Setpoint 2 0.0074906
Carb Rel 2 0.0074906
Carb Volume 1 0.0037453
Carb Temp 1 0.0037453
Hyd Pressure2 1 0.0037453
Hyd Pressure3 1 0.0037453
Density 1 0.0037453
Balling 1 0.0037453
Pressure Vacuum 1 0.0037453
Bowl Setpoint 1 0.0037453
Air Pressurer 1 0.0037453

As we can see, there are a fair number of missing values. PH’s 4 missing values may not be topping the charts; nonetheless, it may be the most problematic. Guidance to handle this scenario is to remove the rows where the PH is null from the training and test sets so that would include removing the rows in the predictor and response sets. We will explain more on this when we get to that step prior to training our model.

Distribution

We are temporarily recoding Brand Code to be numeric so that we can pivot our dataset to be long. And we have a handle on missing values, so we are filtering them out.

train_data |>
  mutate(
    `Brand Code` = recode(
      `Brand Code`,
      "A" = 1,
      "B" = 2,
      "C" = 3,
      "D" = 4
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "values"
  ) |>
  filter(!is.na(values)) |>
  ggplot(aes(x = values)) +
  geom_histogram(bins = 20) +
  facet_wrap(~ variable, ncol = 4, scales = "free") +
  labs(
    title = "Distribution of Data",
    x = "Values",
    y = "Count"
  )

One Hot Encoding

Before we continue exploring, we are going to convert our categorical variable (Brand Code) to multiple variables using one hot encoding. We are doing it now because we are about to examine correlation and we want to include Brand Code in that study. We will store it in an intermediate variable named train_data_pp.

model <- dummyVars(~ ., data = train_data)
train_data_pp <- predict(model, newdata = train_data) |>
  as.data.frame()
test_data_pp <- predict(model, newdata = test_data) |>
  as.data.frame()

Correlation

A correlation matrix is used to help understand the relationships among predictor variables. We are leaving the target (PH) in the dataset so that we may learn of predictors with which it has correlation. Perhaps for no other reason than curiosity, and so that we don’t remove predictors with which it has strong relationships.

cor_matrix <- cor(train_data_pp, use = "pairwise.complete.obs")
corrplot::corrplot(cor_matrix, order = "hclust", tl.cex = 0.7)

And let’s identify those that we may want to remove because they are highly correlated. We will list variables with \(\ge .80\) correlation.

cor_matrix |>
  as.data.frame() |>
  rownames_to_column("variable") |>
  pivot_longer(
    -variable,
    names_to = "correlation",
    values_to = "value"
  ) |>
  mutate(
    abs_value = abs(round(value, 2))
  ) |>
  filter(
    variable != correlation & abs_value >= 0.8
  ) |>
  arrange(desc(abs_value)) |>
  kable()
variable correlation value abs_value
Balling Balling Lvl 0.9782085 0.98
Balling Lvl Balling 0.9782085 0.98
Density Balling 0.9551553 0.96
Balling Density 0.9551553 0.96
Filler Level Bowl Setpoint 0.9489345 0.95
Density Balling Lvl 0.9479195 0.95
Bowl Setpoint Filler Level 0.9489345 0.95
Balling Lvl Density 0.9479195 0.95
Filler Speed MFR 0.9297119 0.93
MFR Filler Speed 0.9297119 0.93
Alch Rel Balling Lvl 0.9258951 0.93
Balling Lvl Alch Rel 0.9258951 0.93
Hyd Pressure2 Hyd Pressure3 0.9248823 0.92
Hyd Pressure3 Hyd Pressure2 0.9248823 0.92
Balling Alch Rel 0.9247934 0.92
Alch Rel Balling 0.9247934 0.92
Brand CodeD Alch Rel 0.8986189 0.90
Density Alch Rel 0.9027398 0.90
Alch Rel Brand CodeD 0.8986189 0.90
Alch Rel Density 0.9027398 0.90
Alch Rel Carb Rel 0.8435849 0.84
Carb Rel Alch Rel 0.8435849 0.84
Carb Rel Balling Lvl 0.8447733 0.84
Balling Lvl Carb Rel 0.8447733 0.84
Density Carb Rel 0.8243548 0.82
Balling Carb Rel 0.8224442 0.82
Carb Rel Density 0.8243548 0.82
Carb Rel Balling 0.8224442 0.82
Carb Pressure Carb Temp 0.8141717 0.81
Carb Temp Carb Pressure 0.8141717 0.81

Linear View

We would like to better understand the data and line plots can be invaluable. But our data is not a time series. Nonetheless, we think that it’s safe to assume that the observations were made over time. So, we will try plotting it with a line plot and see what patterns, if any, surface. We would like to emphasize that in no way are we suggesting that this should be interpreted as a time series. We are simply viewing it through the lens of a line plot.

We will use pivot_longer() to reshape the data so that we can plot it one variable stacked on top of another. We will also use row_number() to create an index for the x-axis. We will use facet_wrap() to create a separate plot for each variable.

train_data_pp |>
  mutate(
    index = row_number()
  ) |>
  relocate(
    index,
    .before = everything()
  ) |>
  pivot_longer(
    cols = -1,
    names_to = "variable",
    values_to = "values"
  ) |>
  ggplot(aes(x = index, y = values)) +
  geom_line() +
  facet_wrap(~ variable, ncol = 1, scales = "free") +
  labs(
    title = "Line Plot of Variables",
    x = "Index",
    y = "Values"
  )

PCA Study

We employed Principal Component Analysis (PCA) as a dimensionality reduction technique to better understand the structure of the predictor space and to address potential multicollinearity among variables. This will also give us our first glimpse at important predictors and related predictors, and we may recognize these patterns in our models, as well.

Given that the final modeling objective is to predict pH, PCA was applied strictly to the predictor variables (i.e., the input features), with the target variable excluded during the transformation phase.

Methodology

The PCA was conducted using the caret and tidyverse packages in R. The procedure included the following key steps:

  1. Data Preparation: All non-numeric variables were excluded. Only numeric predictors were retained for PCA.

  2. Standardization: Each feature was centered and scaled to unit variance to ensure that PCA was not biased toward features with larger scales.

  3. PCA Transformation: Principal components were extracted from the standardized predictor matrix.

  4. Component Retention: The number of components to retain was informed by a combination of the Kaiser criterion (eigenvalues > 1), cumulative variance explained, and visual inspection via a scree plot.

# Load data
data <- read_excel("StudentData.xlsx")

# Extract only numeric predictors, exclude target
predictors <- data %>%
  select(-PH) %>%
  select(where(is.numeric))

# Preprocess: standardize and apply PCA
pca_prep <- preProcess(predictors, method = c("YeoJohnson", "center", "scale", "pca"))

# Transform data using PCA
pca_transformed <- predict(pca_prep, predictors)

# Attach target variable back
pca_final <- bind_cols(pca_transformed, Ph = data$PH)

Results and Interpretation

Variance Explained The PCA transformation resulted in a series of orthogonal components that capture the variance in the original feature space. The cumulative variance explained by the principal components is shown below:

# Extract variance explained
var_explained <- pca_prep$std^2
cumulative_variance <- cumsum(var_explained / sum(var_explained))

# Print table
tibble(
  PC = paste0("PC", seq_along(cumulative_variance)),
  Variance = round(var_explained / sum(var_explained), 3),
  Cumulative = round(cumulative_variance, 3)
) |>
  kable()
PC Variance Cumulative
PC1 0.000 0.000
PC2 0.000 0.000
PC3 0.000 0.000
PC4 0.000 0.000
PC5 0.000 0.000
PC6 0.000 0.000
PC7 0.000 0.000
PC8 0.000 0.000
PC9 0.000 0.000
PC10 0.000 0.000
PC11 0.000 0.000
PC12 0.000 0.000
PC13 0.000 0.000
PC14 0.000 0.000
PC15 0.000 0.000
PC16 0.000 0.000
PC17 0.006 0.006
PC18 0.000 0.006
PC19 0.000 0.006
PC20 0.994 1.000
PC21 0.000 1.000
PC22 0.000 1.000
PC23 0.000 1.000
PC24 0.000 1.000
PC25 0.000 1.000
PC26 0.000 1.000
PC27 0.000 1.000
PC28 0.000 1.000
PC29 0.000 1.000
PC30 0.000 1.000
PC31 0.000 1.000

From this output, we observe that:

The first few components capture a substantial portion of the total variance.

For example, the first 5–7 components typically explain 80–95% of the cumulative variance (exact values will depend on your data).

This dimensionality reduction is significant given that the original predictor space may contain many more features.

Scree Plot A scree plot was generated to visually inspect the point of diminishing returns, or the “elbow”, in the variance explained:

# Scree plot
qplot(
  x = seq_along(var_explained),
  y = var_explained / sum(var_explained),
  geom = "line"
) +
  labs(
    title = "Scree Plot of Principal Components",
    x = "Principal Component",
    y = "Proportion of Variance Explained"
  ) +
  theme_minimal()
Warning: `qplot()` was deprecated in ggplot2 3.4.0.

This plot helps determine the optimal number of PCs to retain. Components beyond the elbow contribute marginally to the variance and may be excluded from further modeling.

Loadings and Interpretability

The rotation matrix provides the loadings of each original variable on the principal components. Loadings close to ±1 indicate strong influence, while values near 0 indicate minimal contribution.

# Loadings (rotation matrix)
loadings <- pca_prep$rotation
head(loadings)
                      PC1          PC2          PC3         PC4         PC5
Carb Volume    0.31531471 -0.143976538  0.045488184 -0.07189120  0.13600322
Fill Ounces   -0.02804574  0.017940116 -0.133090802 -0.20282595  0.13079067
PC Volume     -0.07278973  0.099647400  0.178824163  0.38781808 -0.19510489
Carb Pressure  0.20619275 -0.071221537  0.038921977 -0.19353924 -0.53182292
Carb Temp      0.03636530  0.003945122  0.021207343 -0.16845217 -0.67331557
PSC           -0.03315238  0.010718550 -0.008134559  0.01046622  0.06452204
                       PC6          PC7         PC8         PC9        PC10
Carb Volume   -0.021760941  0.045102276 -0.12829311  0.01041777  0.03311752
Fill Ounces    0.216426729  0.065503533 -0.39709476  0.07464728 -0.26602657
PC Volume     -0.153586014  0.214932499 -0.16145642 -0.15586733  0.08246535
Carb Pressure  0.254292467  0.016808217 -0.02044590 -0.05997268 -0.04167920
Carb Temp      0.287861663 -0.002024995  0.06830562 -0.06607143 -0.05970436
PSC            0.007356659  0.568172967 -0.16528276 -0.60294687  0.06486758
                      PC11         PC12         PC13         PC14         PC15
Carb Volume   -0.026215760 -0.023768427 -0.013568408  0.010221154  0.035607116
Fill Ounces   -0.690829242  0.068393008 -0.175251843 -0.279255279 -0.101276503
PC Volume      0.023553546 -0.002075413 -0.176795649 -0.145182442 -0.327119387
Carb Pressure -0.005858524  0.030806141 -0.009760717  0.015166859  0.007609565
Carb Temp      0.013429257  0.045898580 -0.007002776  0.011706676 -0.026410335
PSC            0.130111885  0.133297403 -0.288414643  0.004157234  0.114045279
                      PC16         PC17         PC18
Carb Volume   -0.004101528 -0.003973328 -0.007212411
Fill Ounces    0.132783984 -0.014254007 -0.082699636
PC Volume     -0.176523379  0.083741443 -0.513324701
Carb Pressure  0.001270145 -0.013424233  0.003461670
Carb Temp     -0.008136387 -0.006426468  0.016104498
PSC            0.134943257 -0.022685155  0.302412691

TODO: as_tibble() causes an error. come back to this guy

pc1 <- loadings |>
  as.data.frame() |>
  rownames_to_column(var = "Predictor1") |>
  as_tibble() |>
  mutate(total_top_3 = PC1 + PC2 + PC3,
         total_top_5 = PC1 + PC2 + PC3 + PC4 + PC5,
         total_top_8 = PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8) |>
  arrange(desc(PC1)) |>
  head(10) |>
  select(1:2)

pc2 <- loadings |>
  as.data.frame() |>
  rownames_to_column(var = "Predictor2") |>
  as_tibble() |>
  mutate(total_top_3 = PC1 + PC2 + PC3,
         total_top_5 = PC1 + PC2 + PC3 + PC4 + PC5,
         total_top_8 = PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8) |>
  arrange(desc(PC2)) |>
  head(10) |>
  select(1,3)

pc3 <- loadings |>
  as.data.frame() |>
  rownames_to_column(var = "Predictor3") |>
  as_tibble() |>
  mutate(total_top_3 = PC1 + PC2 + PC3,
         total_top_5 = PC1 + PC2 + PC3 + PC4 + PC5,
         total_top_8 = PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8) |>
  arrange(desc(PC3)) |>
  head(10) |>
  select(1, 4)

pc4 <- loadings |>
  as.data.frame() |>
  rownames_to_column(var = "Predictor4") |>
  as_tibble() |>
  mutate(total_top_3 = PC1 + PC2 + PC3,
         total_top_5 = PC1 + PC2 + PC3 + PC4 + PC5,
         total_top_8 = PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8) |>
  arrange(desc(PC4)) |>
  head(10) |>
  select(1, 5)

bind_cols(pc1, pc2, pc3, pc4) |>
  arrange(Predictor1, Predictor2, Predictor3, Predictor4) |>
  as_tibble()
# A tibble: 10 × 8
   Predictor1         PC1 Predictor2     PC2 Predictor3    PC3 Predictor4    PC4
   <chr>            <dbl> <chr>        <dbl> <chr>       <dbl> <chr>       <dbl>
 1 Alch Rel        0.359  Pressure V… 0.264  Carb Flow  0.218  Hyd Press… 0.402 
 2 Balling         0.340  Carb Flow   0.177  Hyd Press… 0.130  Hyd Press… 0.295 
 3 Balling Lvl     0.356  Filler Lev… 0.263  PC Volume  0.179  PC Volume  0.388 
 4 Bowl Setpoint   0.118  Hyd Pressu… 0.0435 Carb Rel   0.0869 MFR        0.0936
 5 Carb Pressure   0.206  Temperature 0.0762 Hyd Press… 0.0877 Filler Sp… 0.0949
 6 Carb Rel        0.348  Bowl Setpo… 0.260  Fill Pres… 0.159  Filler Le… 0.307 
 7 Carb Volume     0.315  PC Volume   0.0996 Temperatu… 0.107  Hyd Press… 0.232 
 8 Density         0.345  Oxygen Fil… 0.205  Oxygen Fi… 0.133  Bowl Setp… 0.299 
 9 Filler Level    0.112  Air Pressu… 0.0388 Alch Rel   0.0849 Oxygen Fi… 0.0343
10 Pressure Vacuum 0.0639 PSC Fill    0.0198 Pressure … 0.0716 Carb Rel   0.0281

By examining the loading structure:

We can interpret PC1 as a linear combination emphasizing variables A, B, and C

Components with clear thematic groupings (e.g., all chemistry variables, or all environmental sensors) enhance interpretability and may suggest latent structures in the data.

Preprocessing

We have discovered much and have some work to do. Some preprocessing we will apply to our training and testing datasets to be used for all regression models. Other preprocessing, such as centering, scaling and PCA will be applied more selectively per model.

  • We have a fair amount of correlation.
  • We have NA’s in our target variable.
  • We have predictors with varying amounts of missing data.
  • We have a categorical variable.
  • And we have some data that is not NA, but looks as if it should be.

First, we have already dealt with our categorical data in train_data_pp and test_data_pp. We shall use them as a starting place for further preprocessing.

We will deal with our missing PH data. We don’t want to train our models to predict NA. So we are going to drop the 4 rows missing PH values.

train_data_pp <- train_data_pp |>
  filter(!is.na(`PH`))
# test_data_pp is already void of PH values

And now we will remove highly correlated variables. We will use a correlation threshold of 0.95. While 0.75 is a common starting point, retaining more features helps maintain signal for models like PLS, which can internally handle correlated predictors by extracting latent factors.

cor_matrix <- train_data_pp |>
    select(-PH) |>
    cor(use = "pairwise.complete.obs")
high_corr <- findCorrelation(cor_matrix, cutoff = 0.95)
train_data_pp <- train_data_pp[, -high_corr]
test_data_pp <- test_data_pp[, -high_corr]

For missing data we will impute, but the suggested methods of imputation for different models varies, so we are going to leave that processing to be done at training time.

Our line plots revealed some data that is suspiciously constant over many consecutive observations: Mnf Flow, Hyd Pressure1, Hyd Pressure2, and Hyd Pressure3. We suspect that they are effectively missing data. Do we want to do any of the following:

  1. Leave them as they are.
  2. Remove the variables altogether.
  3. Set the suspicious observations to NA and impute them?
train_data_pp <- train_data_pp |>
  mutate(
    # convert the NAs encoded as numbers into proper NAs.
    `Mnf Flow` = if_else(`Mnf Flow` < 0, NA, `Mnf Flow`),
    # For those with legitimate 0 values, this is flawed.
    `Hyd Pressure1` = if_else(`Hyd Pressure1` == 0, NA, `Hyd Pressure1`),
    `Hyd Pressure2` = if_else(`Hyd Pressure2` == 0, NA, `Hyd Pressure2`),
    `Hyd Pressure3` = if_else(`Hyd Pressure3` == 0, NA, `Hyd Pressure3`)
  )

Regression Workshop

This is our little workshop of utilities. You will see below in process_and_partition that we don’t actually partition our data. Our model being somewhat small, we have decided to use cross validation to test the performance of our models. We think it is better because it uses different combinations of data to train and test our models. If we partition our training dataset then we are compromising our training and testing in one of two mutually exclusive ways. If we chose to partition it and not use cross validation, then we are training our model with only one snapshot of our data, which could lead to bias. If we chose to partition it and use cross validation, then we are training it with an even smaller dataset. We feel we have everything to gain by not partitioning the data and nothing to lose.

process_and_partition() is a simple utility that sets the random seed in hopes that all models will use the same cross validated sequence of data. It extracts y from data, applies preprocessing methods to the predictors. It returns a list including the preprocessed data, the preProcess model, and the response variable.

preprocess <- function(data, methods) {
  set.seed(31415)

  # separate the data into predictors and targets
  X <- dplyr::select(data, -PH)
  y <- data$PH

  # preprocess the data
  model <- preProcess(X, method = methods)
  X <- predict(model, X)
  list(
    PPM = model,
    X = X,
    y = y
  )
}

report() reports accuracy with MAE, RMSE and R2. It also reports the most important variables in descending order.

report <- function(model) {
  mae = mean(model$results$MAE)
  rmse = mean(model$results$RMSE)
  r2 = mean(model$results$Rsquared)
  glue("{model$method}: MAE={round(mae, 2)}, RMSE={round(rmse, 2)}, R2={round(r2, 2)}") |>
    print()
  glue("Best tuned parameters: {colnames(model$bestTune)} = {model$bestTune}") |>
    print()
  print(varImp(model))
}

ctrl is our directive to the training process to use cross validation to train and evaluate our models. All models will use this control method.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

Linear Regression

Refined Preprocessing

data <- preprocess(
  train_data_pp,
  method = c("knnImpute", "YeoJohnson", "center", "scale")
)

Train Partial Least Squares (PLS) Model

pls_model <- train(
  x = data$X,
  y = data$y,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl
)
report(pls_model)
pls: MAE=0.11, RMSE=0.14, R2=0.38
Best tuned parameters: ncomp = 20
pls variable importance

  only 20 most important variables shown (out of 34)

                          Overall
`\\`Mnf Flow\\``           100.00
`\\`Brand Code\\`C`         80.48
`\\`Usage cont\\``          64.55
`\\`Bowl Setpoint\\``       64.05
`\\`Filler Level\\``        54.85
`\\`Hyd Pressure3\\``       52.32
Temperature                 49.47
`\\`Pressure Setpoint\\``   49.16
`\\`Hyd Pressure2\\``       46.01
`\\`Brand Code\\`B`         43.27
`\\`Fill Pressure\\``       40.36
`\\`Hyd Pressure1\\``       36.08
`\\`Pressure Vacuum\\``     35.67
`\\`Carb Pressure1\\``      33.63
`\\`Carb Flow\\``           33.53
`\\`Brand Code\\`D`         30.95
`\\`Brand Code\\`A`         29.53
`\\`Oxygen Filler\\``       29.38
`\\`Hyd Pressure4\\``       23.45
`\\`Carb Rel\\``            23.19

Train Ordinary Least Squares (OLS) Model

Ordinary Least Squares (OLS) is a benchmark linear modeling method. We’ve already preprocessed the data in a way which suits both PLS and OLS. We’ll determine which of these two models is best among linear regression - to do that let’s fit an OLS model.

ols_model <- train(
  x = data$X,
  y = data$y,
  method = "lm",
  trControl = ctrl
)
report(ols_model)
lm: MAE=0.1, RMSE=0.13, R2=0.4
Best tuned parameters: intercept = TRUE
lm variable importance

  only 20 most important variables shown (out of 33)

                          Overall
`\\`Mnf Flow\\``          100.000
`\\`Carb Pressure1\\``     61.468
Temperature                38.469
`\\`Usage cont\\``         37.312
`\\`Bowl Setpoint\\``      34.030
`\\`Brand Code\\`C`        32.573
`\\`Hyd Pressure3\\``      31.683
Density                    29.763
`\\`Pressure Setpoint\\``  26.424
`\\`Oxygen Filler\\``      26.261
`\\`Brand Code\\`A`        18.418
`\\`Alch Rel\\``           16.912
`\\`Carb Flow\\``          13.997
`\\`Filler Level\\``       12.102
`\\`Fill Ounces\\``        11.678
`\\`PSC CO2\\``            11.650
`\\`Hyd Pressure2\\``      11.131
`\\`Carb Rel\\``           10.193
`\\`PC Volume\\``           9.572
`\\`PSC Fill\\``            8.811

OLS is interpretable and useful as a baseline. It assumes linear relationships and independence. It may not perform as well as more complex models if predictors are collinear or relationships are nonlinear.

Train Lasso Regression Model

To further assess whether regularization improves performance, models like Lasso, Ridge, and Elastic Net could be explored. These approaches can reduce overfitting and handle correlated predictors more effectively than OLS. In future work or production deployment, it would be advisable to test these variants and compare their performance using the same repeated cross-validation framework.

lasso_model <- train(
  x = data$X,
  y = data$y,
  method = "lasso",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = NULL
)
report(lasso_model)
lasso: MAE=0.11, RMSE=0.14, R2=0.37
Best tuned parameters: fraction = 0.9
loess r-squared variable importance

  only 20 most important variables shown (out of 34)

                    Overall
`Mnf Flow`          100.000
`Bowl Setpoint`      59.594
`Filler Level`       51.518
`Usage cont`         49.516
`Pressure Setpoint`  46.216
`Carb Flow`          41.604
`Brand Code`C        39.243
`Pressure Vacuum`    26.382
`Hyd Pressure3`      26.289
`Hyd Pressure2`      22.913
`Fill Pressure`      20.184
`Brand Code`D        16.949
`Oxygen Filler`      14.012
`Carb Rel`           12.635
Temperature          11.965
`Hyd Pressure1`      11.618
`Alch Rel`           10.601
`Hyd Pressure4`      10.396
`Brand Code`A         4.905
MFR                   4.545

Train Ridge Regression Model

ridge_model <- train(
  x = data$X,
  y = data$y,
  method = "ridge",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = NULL
)
report(ridge_model)
ridge: MAE=0.1, RMSE=0.13, R2=0.4
Best tuned parameters: lambda = 0.01
loess r-squared variable importance

  only 20 most important variables shown (out of 34)

                    Overall
`Mnf Flow`          100.000
`Bowl Setpoint`      59.594
`Filler Level`       51.518
`Usage cont`         49.516
`Pressure Setpoint`  46.216
`Carb Flow`          41.604
`Brand Code`C        39.243
`Pressure Vacuum`    26.382
`Hyd Pressure3`      26.289
`Hyd Pressure2`      22.913
`Fill Pressure`      20.184
`Brand Code`D        16.949
`Oxygen Filler`      14.012
`Carb Rel`           12.635
Temperature          11.965
`Hyd Pressure1`      11.618
`Alch Rel`           10.601
`Hyd Pressure4`      10.396
`Brand Code`A         4.905
MFR                   4.545

Train Elastic Net Model

elastic_model <- train(
  x = data$X,
  y = data$y,
  method = "glmnet",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = NULL
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
report(elastic_model)
glmnet: MAE=0.11, RMSE=0.14, R2=NaN
Best tuned parameters: alpha = 0.952631578947368
Best tuned parameters: lambda = 0.000335359611608051
glmnet variable importance

  only 20 most important variables shown (out of 34)

                    Overall
`Mnf Flow`          100.000
`Hyd Pressure3`      62.568
`Brand Code`C        50.353
`Bowl Setpoint`      49.733
`Carb Pressure1`     37.185
Density              36.771
`Alch Rel`           32.715
`Usage cont`         24.122
Temperature          21.838
`Pressure Setpoint`  19.312
`Hyd Pressure2`      16.036
`Oxygen Filler`      15.373
`Brand Code`A        14.679
`Filler Level`       13.963
`Carb Rel`           10.869
`Carb Flow`          10.216
`Carb Volume`         9.254
`Fill Pressure`       7.366
`Fill Ounces`         6.815
`Hyd Pressure1`       6.218

We see a warning after running the model for Elastic net that there were missing values in resampled performance measures. Since our model finished training and we still got valid results, and as we will see below our metrics for Elastic Net are close to the others, we will likely not recommend selecting it as our final model unless it offers clear advantages (which we will see that it does not.)

Comparison of OLS, PLS, and Lasso

After training and comparing five linear modeling approaches — OLS, PLS, Lasso, Ridge, and Elastic Net — we find that performance across all methods is remarkably similar. Each model was evaluated using repeated 10-fold cross-validation, and results were compared using RMSE (Root Mean Squared Error), R-squared, and MAE (Mean Absolute Error).

The best RMSE and best R-squared values were achieved by OLS and Elastic Net, but the differences between all models are minuscule (less than 0.0001 RMSE units). The lowest MAE was from OLS at 0.104241, but again, the margin is extremely small.

Given the similarity in performance, we recommend Ordinary Least Squares (OLS) for this use case: - It is simple, interpretable, and fast to compute. - It slightly outperformed PLS and Lasso in both MAE and RMSE. - Its coefficients can be used directly for business insight and explainability.

Elastic Net showed nearly identical performance but triggered a convergence warning during resampling, suggesting mild instability. If regularization becomes important due to changing data conditions or overfitting concerns in future phases, Lasso or Elastic Net could be revisited.

This modeling exercise confirms that pH can be predicted with consistent accuracy using standard linear models. The choice of model does not significantly alter accuracy, allowing the business to prioritize simplicity and transparency. Implementing the OLS model enables stakeholders to: - Understand which production variables most influence pH - Monitor predictions in real time using a straightforward model - Translate coefficients into actionable guidance for technicians

Non Linear Regression

Refined Preprocessing

data <- preprocess(train_data_pp, c("center", "scale", "medianImpute", "pca"))

KNN Model

k-Nearest Neighbors is a non-parametric method used for classification and regression. It works by finding the k-nearest neighbors to a given point and using their values to predict the value of that point.

knn_model <- train(
  x = data$X,
  y = data$y,
  method = "knn",
  tuneLength = 20,
  trControl = ctrl
)
report(knn_model)
knn: MAE=0.1, RMSE=0.13, R2=0.45
Best tuned parameters: k = 9
loess r-squared variable importance

      Overall
PC2  100.0000
PC1   32.9145
PC6   30.1517
PC7   27.3098
PC17  16.6429
PC4   13.9365
PC8   13.1987
PC20  11.9676
PC3    9.8122
PC15   8.7371
PC5    8.3752
PC18   7.8995
PC9    7.2142
PC16   7.1215
PC19   6.2098
PC12   3.1794
PC13   2.5477
PC10   1.5031
PC11   0.5908
PC14   0.0000

SVM Model

Support Vector Machines (SVM) is a supervised learning algorithm that can be used for classification or regression. It works by finding the hyperplane that best separates the data into different classes. In this case, we are using it for regression.

Linear SVM Model

svml_model <- train(
  x = data$X,
  y = data$y,
  method = "svmLinear",
  tuneLength = 10,
  trControl = ctrl
)
report(svml_model)
svmLinear: MAE=0.11, RMSE=0.14, R2=0.34
Best tuned parameters: C = 1
loess r-squared variable importance

      Overall
PC2  100.0000
PC1   32.9145
PC6   30.1517
PC7   27.3098
PC17  16.6429
PC4   13.9365
PC8   13.1987
PC20  11.9676
PC3    9.8122
PC15   8.7371
PC5    8.3752
PC18   7.8995
PC9    7.2142
PC16   7.1215
PC19   6.2098
PC12   3.1794
PC13   2.5477
PC10   1.5031
PC11   0.5908
PC14   0.0000

Polynomial SVM Model

svmp_model <- train(
  x = data$X,
  y = data$y,
  method = "svmPoly",
  tuneLength = 3,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(svmp_model)
svmPoly: MAE=0.11, RMSE=0.15, R2=0.34
Best tuned parameters: degree = 3
Best tuned parameters: scale = 0.01
Best tuned parameters: C = 1
loess r-squared variable importance

      Overall
PC2  100.0000
PC1   32.9145
PC6   30.1517
PC7   27.3098
PC17  16.6429
PC4   13.9365
PC8   13.1987
PC20  11.9676
PC3    9.8122
PC15   8.7371
PC5    8.3752
PC18   7.8995
PC9    7.2142
PC16   7.1215
PC19   6.2098
PC12   3.1794
PC13   2.5477
PC10   1.5031
PC11   0.5908
PC14   0.0000

Radial SVM Model

svmr_model <- train(
  x = data$X,
  y = data$y,
  method = "svmRadial",
  tuneLength = 5,
  trControl = ctrl
)
report(svmr_model)
svmRadial: MAE=0.09, RMSE=0.12, R2=0.49
Best tuned parameters: sigma = 0.0322972739788172
Best tuned parameters: C = 4
loess r-squared variable importance

      Overall
PC2  100.0000
PC1   32.9145
PC6   30.1517
PC7   27.3098
PC17  16.6429
PC4   13.9365
PC8   13.1987
PC20  11.9676
PC3    9.8122
PC15   8.7371
PC5    8.3752
PC18   7.8995
PC9    7.2142
PC16   7.1215
PC19   6.2098
PC12   3.1794
PC13   2.5477
PC10   1.5031
PC11   0.5908
PC14   0.0000

MARS Model

Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression technique that can be used for both linear and non-linear regression. It works by fitting piecewise linear functions to the data.

grid = expand.grid(
  degree = 1:2,
  nprune = 5:10
)
mars_model <- train(
  x = data$X,
  y = data$y,
  method = "earth",
  tuneGrid = grid,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(mars_model)
earth: MAE=0.11, RMSE=0.14, R2=0.3
Best tuned parameters: nprune = 10
Best tuned parameters: degree = 2
earth variable importance

     Overall
PC2  100.000
PC1   71.570
PC6   71.570
PC7   52.247
PC17  28.489
PC18  24.062
PC13  16.917
PC11   9.624
PC15   9.624
PC16   0.000

Neural Network Model

Neural Networks are a class of models that are inspired by the way the human brain works. They are used for both classification and regression tasks. In this case, we are using it for regression.

grid <- expand.grid(
  size = c(2, 4, 6, 8, 10),
  decay = c(0, 0.05, 0.1, 0.15)
)
nn_model <- train(
  x = data$X,
  y = data$y,
  method = "nnet",
  linout = TRUE,
  trace = FALSE,
  maxit = 1000,
  tuneGrid = grid,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(nn_model)
nnet: MAE=0.11, RMSE=0.17, R2=0.38
Best tuned parameters: size = 6
Best tuned parameters: decay = 0.1
nnet variable importance

     Overall
PC20  100.00
PC6    97.17
PC2    93.23
PC19   82.66
PC7    82.64
PC16   80.55
PC17   68.43
PC4    68.32
PC1    58.24
PC18   54.43
PC10   44.00
PC13   34.40
PC15   33.81
PC8    32.10
PC5    30.84
PC3    28.66
PC11   28.27
PC14   18.44
PC12   10.77
PC9     0.00

Decision Trees

Refined Preprocessing

rpart_data <- preprocess(train_data_pp, "medianImpute")
# column names with backticks are causing issues with the caret package.
rpart_data$X <- clean_names(rpart_data$X)

Recursive Partitioning Model

rpart_model <- train(
  x = rpart_data$X,
  y = rpart_data$y,
  method = "rpart",
  tuneLength = 10,
  trControl = ctrl
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
report(rpart_model)
rpart: MAE=0.11, RMSE=0.14, R2=0.34
Best tuned parameters: cp = 0.0136301352354267
rpart variable importance

  only 20 most important variables shown (out of 34)

                  Overall
mnf_flow          100.000
temperature        88.758
usage_cont         80.417
filler_speed       76.020
balling_lvl        69.764
fill_pressure      51.009
bowl_setpoint      48.788
hyd_pressure3      48.601
filler_level       45.544
air_pressurer      45.273
pressure_vacuum    41.667
brand_code_c       40.155
carb_flow          36.453
pressure_setpoint  32.795
carb_pressure1     32.761
carb_rel           29.118
oxygen_filler      28.459
alch_rel           20.989
brand_code_d       18.941
carb_volume         7.517

Visualize the model

rpart.plot(
  rpart_model$finalModel,
  type = 2,
  extra = 101,
  fallen.leaves = TRUE,
  main = "Decision Tree for pH Prediction"
)

The recursive partitioning model was built using the rpart method with 10 levels of complexity parameter (cp) tuning. The optimal cp value found was 0.0136, balancing tree complexity and performance. While the decision tree offers good interpretability, its predictive power was moderate, with an Rsquared of 0.34 indicating only limited explanatory capability for the variance in the response variable. The visualization of the decision tree helps understand how key variables influence the pH prediction.

The top three predictors identified were: mnf_flow (100% importance), brand_code_c (92.93) and pressure_vacuum (88.75).

Random Forest Model

rf_data <- rpart_data

# Random forest takes a very long time to run with our default parameters.
# We are cutting down on cross validation
rf_model <- train(
  x = rf_data$X,
  y = rf_data$y,
  importance = TRUE,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(rf_model)
rf: MAE=0.08, RMSE=0.11, R2=0.63
Best tuned parameters: mtry = 18
rf variable importance

  only 20 most important variables shown (out of 34)

                Overall
mnf_flow         100.00
brand_code_c      92.93
pressure_vacuum   88.75
oxygen_filler     77.26
balling_lvl       65.47
air_pressurer     61.98
temperature       59.73
alch_rel          58.12
carb_rel          58.11
usage_cont        54.06
density           52.87
hyd_pressure3     51.18
filler_speed      48.47
carb_flow         45.49
carb_pressure1    44.49
hyd_pressure1     43.02
bowl_setpoint     40.53
hyd_pressure2     38.17
fill_pressure     31.83
filler_level      30.07

Visualize the model

varImpPlot(
  rf_model$finalModel,
  type = 1,
   main = "Random Forest Variable Importance"
)

The Random Forest outperformed the decision tree, nearly doubling the explanatory power (Rsquared of 0.63). It provided much stronger predictive performance, suggesting it is better suited for this dataset despite longer training time and reduced interpretability.

Top predictors included: mnf_flow (100% importance), consistent with the decision tree.

Model Comparison

Having used cross-validation to build our models, they are now populated with \(results\). Here we shall take a look at some of those summary statistics and compare one model to another.

model_list <- list(
  PLS = pls_model,
  OLS = ols_model,
  Lasso = lasso_model,
  Ridge = ridge_model,
  ElasticNet = elastic_model,
  KNN = knn_model,
  SVM_linear = svml_model,
  SVM_poly = svmp_model,
  SVM_radial = svmr_model,
  MARS = mars_model,
  NeuralNet = nn_model,
  RPart = rpart_model,
  RandomForest = rf_model
)

# extract the results from each model
model_results <- tibble(
  Model = names(model_list),
  MAE = map(
    model_list, function(x) mean(x$results$MAE, na.rm = TRUE)
  ) |> as_vector(),
  RMSE = map(
    model_list, function(x) mean(x$results$RMSE, na.rm = TRUE)
  ) |> as_vector(),
  Rsquared = map(
    model_list, function(x) mean(x$results$Rsquared, na.rm = TRUE)
  ) |> as_vector()
)
model_results |>
  arrange(desc(Rsquared), MAE, RMSE) |>
  mutate(
    MAE = round(MAE, 3),
    RMSE = round(RMSE, 3),
    Rsquared = round(Rsquared, 3)
  ) |>
  kable()
Model MAE RMSE Rsquared
RandomForest 0.080 0.108 0.626
SVM_radial 0.093 0.125 0.485
KNN 0.099 0.128 0.455
OLS 0.104 0.134 0.396
Ridge 0.104 0.134 0.396
NeuralNet 0.109 0.171 0.384
PLS 0.106 0.136 0.380
Lasso 0.108 0.138 0.371
ElasticNet 0.109 0.139 0.370
RPart 0.111 0.141 0.340
SVM_poly 0.113 0.148 0.338
SVM_linear 0.110 0.141 0.337
MARS 0.114 0.144 0.302

The Champion

“The judges have decided. A champion, ideally, is a model with the lowest MAE, the lowest RMSE, and the highest R2. And we have such a champion. Please allow me to present the champion of our modeling competition: Random Forest. It is simple, moderately interpretable, but not very fast to compute.”

Among all tested models, the Random Forest model did demonstrate the best overall performance, achieving the lowest MAE (0.080), lowest RMSE (0.108), and highest r squared (0.626), indicating superior predictive accuracy and explanatory power.

The SVM radial and KNN models followed, with moderate performance (r squared of 0.485 and 0.455, respectively), but did not match the precision of Random Forest. Linear models such as OLS, Ridge, and Elastic Net showed similar results, with r squared values clustered around 0.37–0.40.

The Neural Network, PLS, Lasso, and RPart (Decision Tree) models exhibited slightly weaker performance, with r squared values below 0.40. The MARS model had the lowest performance overall (r squared of 0.302).

In conclusion, Random Forest is the recommended model for this dataset due to its balance of accuracy and robustness, despite being more complex and computationally intensive.

Final Predictions

Finally, we will make predictions using our champion model, Random Forest. We will use the test data to make predictions and save them to a CSV file.

# preprocess the data with the same model with which we preprocessed the training model
df <- predict(rf_data$PPM, test_data_pp)
# the variable names in our training set caused problems with `randomForest` so we 
# "cleaned" them. The model is expecting the same variables.
df <- clean_names(df)
data.frame(
  SampleID = 1:nrow(df),
  Predicted_pH = predict(rf_model, newdata = df)
) |>
  write.csv(
    "Final_Predictions.csv",
    row.names = FALSE
  )